We investigate the effectiveness of generative adversarial networks (GANs)for speech enhancement, in the context of improving noise robustness ofautomatic speech recognition (ASR) systems. Prior work demonstrates that GANscan effectively suppress additive noise in raw waveform speech signals,improving perceptual quality metrics; however this technique was not justifiedin the context of ASR. In this work, we conduct a detailed study to measure theeffectiveness of GANs in enhancing speech contaminated by both additive andreverberant noise. Motivated by recent advances in image processing, we proposeoperating GANs on log-Mel filterbank spectra instead of waveforms, whichrequires less computation and is more robust to reverberant noise. While GANenhancement improves the performance of a clean-trained ASR system on noisyspeech, it falls short of the performance achieved by conventional multi-styletraining (MTR). By appending the GAN-enhanced features to the noisy inputs andretraining, we achieve a 7% WER improvement relative to the MTR system.
展开▼